Along with Transparent Proxy, an exciting minor feature lands with the Consul 1.10 release and I’m happy to say it’s been on my wish list since HashiCorp entered the service mesh space by adding Connect to Consul. Since that time a lot of options have emerged around the mesh industry of secure and opaque connectivity between services and orchestrated containers. In my mind this feature helps to get services and containers back on track where they went wrong almost 12 years ago with LXC and eventually Docker. I’m talking about UNIX domain sockets or UDS.
If you think back to the first time you tried Docker or LXC, it was remarkable to get an isolated environment up in seconds with its own filesystem and IP address without needing to start a VM and an OS. Legacy chroots never isolated networking. If two developers need to run services on port 443 then just give them each a container with its own IP and let them use whatever ports they want via network namespaces and virtual interfaces. You might compare your container host to a virtual coffee shop with a wifi router performing NAT on its single public IP to give everybody a private IP and masquerade their outgoing requests. This seems simple enough but what if you need to send a message directly to someone in the coffee shop next door which has its own network? What if someone needs to send your IP a message or ask a question from outside of any coffee shop? Imagine the physical instance where you need to grab a pen and fill out a TCP/IP header form to communicate just a ~1500 byte MTU (Max Transmission Unit) segment of your message to someone else in another network. Each form would look something like this:
On WiFi with poor connection quality, this might be helpful since missed packets need to be resent entirely via TCP retry if there is any defect detected, but that’s not normally an issue between most servers. For background on Consul traffic routing, Christoph Puhl has a fantastic write-up and talk on The Life of a Packet through a service mesh in a container environment. Also here is a great history of the 1500 MTU.
These are some of the networking problems faced by container orchestration and simple NAT networks everywhere. IPC or Interprocess Communication is great for communicating within a single host but bridging that to another host is tricky. It’s handy for every container to have its own IP address because most legacy apps with a TCP listener can be easily containerized but how do we pair up all of these private/public IPs and connect them only when necessary? The answer comes in the form of various proxies and overlay networks that use random DHCP IPs and port numbers. You might think of container or pod communication like this:
If that’s not complicated enough, have a look at iptables on a Kubernetes host running kube-proxy and an overlay network like Calico or Flannel. The good news is that routing these packets via the kernel’s internal iptables gives good performance presenting a traditional IPv4 address to container developers. All the kernel needs to do is flip a few bits and checksum the packet header as long as the packet is within all applicable MTU limits. There is no need to copy a full packet into an application in user space to process communications. The bad news is that all of this complex traffic doesn’t scale well. In fact performance highlighted the need to completely replace the kernel’s internal iptables structures with nftables. A Linux machine running 500 containers managing complex routing with hundreds of internal/external IP addresses and possibly external load balancers started to push the limits of what iptables was originally intended for. In essence the complex IPv4 arrangements required for large scale Linux container workloads helped drive major changes to the kernel itself. Luckily nftables was built to be backwards compatible with classic iptables management casual users didn’t notice.
Note that all of this network routing via the Linux kernel has only enabled connectivity. That says nothing about encryption or ACL protection. In fact most overlay networks give each node a whole new bridge which must use a unique subnet within the network. This usually allows broad network-wide connections whether you want them or not: all or nothing. Also, don’t forget encryption is still up to you and your application. If you use a proxy or terminate TLS outside of this the TCP flow will most likely be desegmented and re-assembled across the network, duplicating work at TCP layer 4.
If all of the above has bored you with tedium then I’ve achieved exactly what I wanted to. When you use TCP on your IP to communicate there is a lot of overhead involved — especially with containers. Often times your message gets wrapped into TCP or UDP frames, passed around to multiple inspectors, and then returned after it was determined that the destination was actually your local loopback adapter all along. Essentially the kernel has a virtual network adapter for localhost communications just for local IPC using TCP. Then there’s MTU, which dictates the maximum size of a packet and defaults to 1500 in most interfaces. Loopback often uses a large MTU of 64000 to submit more at a time but that must be reduced or packets restructured when meeting an interface with lower MTU, which costs CPU time. Considering 10Gb and 100Gb networking, this is why we’re seeing a renewed interest in smart NICs and TCP offload for flow control, encryption, and compression. Note that as loopback is a software interface in the kernel, local traffic probably won’t benefit from any offloading on your fancy smart NIC.